18 May 2017
Do other time series analysis
But also other type of analyses that involve processing timestamp data.
library(padr) library(dplyr) padr::emergency %>% head
## # A tibble: 6 x 6 ## lat lng zip title time_stamp ## <dbl> <dbl> <int> <chr> <dttm> ## 1 40.29788 -75.58129 19525 EMS: BACK PAINS/INJURY 2015-12-10 17:40:00 ## 2 40.25806 -75.26468 19446 EMS: DIABETIC EMERGENCY 2015-12-10 17:40:00 ## 3 40.12118 -75.35198 19401 Fire: GAS-ODOR/LEAK 2015-12-10 17:40:00 ## 4 40.11615 -75.34351 19401 EMS: CARDIAC EMERGENCY 2015-12-10 17:40:01 ## 5 40.25149 -75.60335 NA EMS: DIZZINESS 2015-12-10 17:40:01 ## 6 40.25347 -75.28324 19446 EMS: HEAD INJURY 2015-12-10 17:40:01 ## # ... with 1 more variables: twp <chr>
coffee
## time_stamp amount ## 1 2016-07-07 09:11:21 3.14 ## 2 2016-07-07 09:46:48 2.98 ## 3 2016-07-09 13:25:17 4.11 ## 4 2016-07-10 10:45:11 3.14
Every row is a single observation, typically on second level. You want to do analysis on a (much) higher level.
padr offers: thicken. Used in conjunction with a database package, like dplyr or data.table.emergency %>% thicken(interval = "month") %>% count(time_stamp_month) %>% head()
## # A tibble: 6 x 2 ## time_stamp_month n ## <date> <int> ## 1 2015-12-01 7969 ## 2 2016-01-01 13205 ## 3 2016-02-01 11467 ## 4 2016-03-01 11101 ## 5 2016-04-01 11326 ## 6 2016-05-01 11423
When there is no observation, there is no record.
padr offers: paddata.frame(dt = as.Date(c("2017-02-23", "2017-02-26")),
val = c(2, 4)) %>%
pad(interval = "day")
## pad applied on the interval: day
## dt val ## 1 2017-02-23 2 ## 2 2017-02-24 NA ## 3 2017-02-25 NA ## 4 2017-02-26 4
Think of timedata as having a hearbeat. It produces data at a certain interval.
padr currently uses eight intervals: year, quarter, month, week, day, hour, minute, and second.
get_interval(emergency$time_stamp)
## [1] "sec"
The interval is the highest of the eight that can explain all the instances observed in the data.
dt <- as.Date(c("2017-02-23", "2017-02-26", "2017-02-27"))
all(dt %in% seq(dt %>% min, dt %>% max, by = 'day'))
## [1] TRUE
This week v0.3.0 came out on CRAN. The interval is widened, it now allows for units other than 1, within each interval.
as.Date(c("2017-05-12", "2017-05-14", "2017-05-18")) %>%
get_interval()
## [1] "2 day"
as.POSIXct(c("2017-05-14 09:00:00", "2017-05-14 09:00:05",
"2017-05-14 09:00:20")) %>%
get_interval()
## [1] "5 sec"
The thicken function takes in a data frame, then it does:
coffee %>% thicken(interval = "day")
## time_stamp amount time_stamp_day ## 1 2016-07-07 09:11:21 3.14 2016-07-07 ## 2 2016-07-07 09:46:48 2.98 2016-07-07 ## 3 2016-07-09 13:25:17 4.11 2016-07-09 ## 4 2016-07-10 10:45:11 3.14 2016-07-10
coffee %>% thicken(interval = "15 min", rounding = "up")
## time_stamp amount time_stamp_15_min ## 1 2016-07-07 09:11:21 3.14 2016-07-07 09:15:00 ## 2 2016-07-07 09:46:48 2.98 2016-07-07 10:00:00 ## 3 2016-07-09 13:25:17 4.11 2016-07-09 13:30:00 ## 4 2016-07-10 10:45:11 3.14 2016-07-10 11:00:00
thicken parameters:
x
interval
colname = NULL
rounding = c("down", "up")
by = NULL
start_val = NULL
The pad function takes in a data frame, then it does:
NA values for the other variables.coffee %>% thicken(interval = "day", colname = "d") %>% count(d) %>% pad()
## pad applied on the interval: day
## # A tibble: 4 x 2 ## d n ## <date> <int> ## 1 2016-07-07 2 ## 2 2016-07-08 NA ## 3 2016-07-09 1 ## 4 2016-07-10 1
x interval = NULL start_val = NULL end_val = NULL by = NULL group = NULL break_above = 1
Padding can thus also be done within a grouping variable.
emergency %>%
thicken('month', col = "m") %>%
count(title, m) %>%
pad(group = "title",
start_val = as.Date("2015-12-01"),
end_val = as.Date("2016-10-01"))
## pad applied on the interval: month
## Source: local data frame [1,287 x 3] ## Groups: title [117] ## ## title m n ## <chr> <date> <int> ## 1 EMS: ABDOMINAL PAINS 2015-12-01 128 ## 2 EMS: ABDOMINAL PAINS 2016-01-01 186 ## 3 EMS: ABDOMINAL PAINS 2016-02-01 161 ## 4 EMS: ABDOMINAL PAINS 2016-03-01 184 ## 5 EMS: ABDOMINAL PAINS 2016-04-01 185 ## 6 EMS: ABDOMINAL PAINS 2016-05-01 162 ## 7 EMS: ABDOMINAL PAINS 2016-06-01 158 ## 8 EMS: ABDOMINAL PAINS 2016-07-01 143 ## 9 EMS: ABDOMINAL PAINS 2016-08-01 176 ## 10 EMS: ABDOMINAL PAINS 2016-09-01 174 ## # ... with 1,277 more rows
After padding you are left with the missing values for the imputed records.
padded_df <- data.frame(dt = as.Date(c("2017-02-23", "2017-02-25",
"2017-02-28")), val = c(2, 4, 2)) %>% pad()
## pad applied on the interval: day
padded_df
## dt val ## 1 2017-02-23 2 ## 2 2017-02-24 NA ## 3 2017-02-25 4 ## 4 2017-02-26 NA ## 5 2017-02-27 NA ## 6 2017-02-28 2
Depending on the nature of the data you might want to:
Carry the last value forward
padded_df %>% tidyr::fill(val)
## dt val ## 1 2017-02-23 2 ## 2 2017-02-24 2 ## 3 2017-02-25 4 ## 4 2017-02-26 4 ## 5 2017-02-27 4 ## 6 2017-02-28 2
Depending on the nature of the data you might want to:
Fill all the missings with the same value
padded_df %>% fill_by_value(val, value = 42)
## dt val ## 1 2017-02-23 2 ## 2 2017-02-24 42 ## 3 2017-02-25 4 ## 4 2017-02-26 42 ## 5 2017-02-27 42 ## 6 2017-02-28 2
Depending on the nature of the data you might want to:
Fill all the missings with a function of the nonmissings
padded_df %>% fill_by_function(val, fun = mean)
## dt val ## 1 2017-02-23 2.000000 ## 2 2017-02-24 2.666667 ## 3 2017-02-25 4.000000 ## 4 2017-02-26 2.666667 ## 5 2017-02-27 2.666667 ## 6 2017-02-28 2.000000
Depending on the nature of the data you might want to:
Fill all the missings with the most prevalent of the nonmissings
padded_df %>% fill_by_prevalent(val)
## dt val ## 1 2017-02-23 2 ## 2 2017-02-24 2 ## 3 2017-02-25 4 ## 4 2017-02-26 2 ## 5 2017-02-27 2 ## 6 2017-02-28 2
library(ggplot2) animal_bites_plot <- emergency %>% filter(title == 'EMS: ANIMAL BITE') %>% thicken(interval = 'day', col = 'ts_day') %>% count(ts_day) %>% pad() %>% fill_by_value(n) %>% ggplot(aes(ts_day, n)) + geom_point() + geom_line() + geom_smooth()
## pad applied on the interval: day
animal_bites_plot
There are two vignettes, a general introduction and more details on the implementation.
vignette("padr")
vignette("padr_implementation")
I blog about changes in padr on: thats-so-random.com
And the package is maintained on: github.com/EdwinTh/padr